Processing instruction addressed by received remote instruction and generating remote instruction to respective output port for another cell

ABSTRACT

Embodiments of the invention relate to a processing cell for use in computing systems. Generally, a processing cell generates remote instructions to be received and processed by at least one other processing cell. A processing cell may include a program counter, an instruction memory, and appropriate elements such as a branch lookup, a branch unit, etc. Alternatively, the processing cell may include a state machine that replaces the program counter and the instruction memory. Embodiments of the invention are able to support the VLIW mode, the MIMD) mode, a mixture of both modes of execution, etc.

FIELD OF THE INVENTION

The present invention relates generally to computing systems and, morespecifically, to processing cells for use in such systems.

BACKGROUND OF THE INVENTION

Traditionally, to control computations on a microprocessor, themicroprocessor is provided with a centralized instruction-issue unit anda branch unit. The instruction-issue unit issues instructions thatcontrol the cycle-by-cycle operations of the microprocessor's resources,while the branch unit steers execution in time, directs the flow ofcontrol, determines the sequence of instructions that should be issued,etc.

As chip density increases, emerging devices have the capacity toaccommodate huge numbers of functional units, which can potentiallydeliver much higher performance than current devices. As the number offunctional units, especially on programmable devices, increases,efficiently and flexibly controlling these devices raises variousissues. In many situations, the centralized point of control intraditional microprocessors with branch units is inadequate for managingthis vastly increased number of functional units. For example, toexploit thread-level parallelism, a computing platform has to trackmultiple flows of control. Traditional centralized architecture, withits single flow of control, is unable to do this.

Conventional MIMD (multiple instructions multiple data) machines alsohave limitations in supporting thread-level parallelism. These machinesusually limit each thread of execution to a microprocessor becausecontrol between different processors of a MIMD machine is generally sodecoupled as to make it difficult to statically orchestrate theirexecution. A highly-parallel thread is usually unable to make full useof parallelism because of insufficient hardware resources in each MIMDprocessor, resources that are normally fixed in hardware. Dynamicallyspawning the work to other processors on a MIMD machine is usually doneat very coarse granularity. This is due to high overheads arising fromdynamic coordination that is needed when a single logical thread issplit into multiple actual threads, each running on a differentprocessor of a MIMD machine. This misses opportunities for exploitingparallelism and efficient use of computing resources.

Multi-threaded control architectures, such as SMT (simultaneousmulti-threading), support multiple flows of control that share a commonpool of functional units, allow sharing of functional unit resourcesacross multiple threads of control, etc. However, they usually adopt acentralized point of control and dynamic instruction issue coordinationthat have problems with implementation and scaling. As a result, theyare generally unable to accommodate either a larger number ofsimultaneously executing threads or a large number of functional units.

Distributing control information from a centralized control becomesworse with large, faster chips. With faster clock speed, there is lesstime for signals to propagate each cycle. With smaller silicon havingnarrower and taller wires, signal propagation speed along these wiresdeteriorates. Under centralized control architecture, all these signalsneed to be brought to the central point, which causes a scalingbottleneck.

Based on the foregoing, it is desirable that mechanisms be provided tosolve the above deficiencies and related problems.

SUMMARY OF THE INVENTION

The present invention, in various embodiments, is related to aprocessing cell for use in computing systems. Generally, a processingcell generates branch commands or instructions to be received andprocessed by at least one other processing cell. A processing cell maybe instruction-based that includes a program counter, an instructionmemory, and appropriate elements such as a branch lookup, a branch unit,an ALU, etc., for computations. Alternatively, the processing cell isstate-machine based, which is comparable to an instruction-based cell,but includes a state machine that replaces the program counter and theinstruction memory. Embodiments of the invention are able to support atleast the VLIW (Very Long Instruction Word) mode, the MIMD (MultipleInstructions Multiple Data) mode, and a mixture of both modes ofexecution.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example, and not by wayof limitation, in the figures of the accompanying drawings in which likereference numerals refer to similar elements and in which:

FIG. 1A shows a processing cell with an instruction-based control, inaccordance with an embodiment;

FIG. 1B shows a processing cell with a state-machine based control, inaccordance with an embodiment;

FIG. 2 shows how processing cells are re-allocated between logicalthreads, in accordance with an embodiment;

FIG. 3 shows a processing system using processing cells, in accordancewith an embodiment;

FIG. 4 shows a first piece of programming code to be implemented on theprocessing system of FIG. 3, in accordance with an embodiment;

FIG. 5 shows a processing system of FIG. 3 implemented with theprogramming code of FIG. 4, in accordance with an embodiment;

FIG. 6 shows a second piece of programming code to be implemented on theprocessing system of FIG. 3, in accordance with an embodiment;

FIG. 7 shows a processing system of FIG. 3 implemented with theprogramming code of FIG. 6, in accordance with an embodiment;

FIG. 8 shows a parallel program to be implemented in a processing systemusing processing cells, in accordance with an embodiment;

FIG. 9 shows a processing system executing the program in FIG. 8, inaccordance with an embodiment; and

FIG. 10 shows a schedule used in the example of FIGS. 8 and 9, inaccordance with an embodiment.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

In the following description, for the purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the present invention. However, it will be apparent toone skilled in the art that the invention may be practiced without thesespecific details. In other instances, well-known structures and devicesare shown in block diagram form in order to avoid obscuring theinvention.

Processing Cell—Instruction Base

FIG. 1A shows a processing cell 100 with an instruction-based control,in accordance with an embodiment. Processing cell 100 may be referred toas a processing element, a processor, a processor with enhancedfeatures, or their equivalence. Processing cell 100 includes a branchcombiner 110, an operand un-format 120, a branch lookup 130, aninstruction memory 140, a program counter 150, a branch unit 160, anoperand format 170, dedicated functional units 180, a plurality of ANDgates 190, and a plurality of latches 194.

Branch combiner or input combiner 110 receives information, e.g.,commands, instructions, etc., from normally other processing cells.These commands or instructions that are received from and/or sent toother processing cells may be referred to as remote commands or remoteinstructions. Branch combiner 110 may merge these remote commands asappropriate. Inputs to branch combiner 110 generally come from outputs,e.g., AND gates 190 and latches 194, of other processing cells.Depending on implementations, branch combiner 110 can be an OR gate,which relies on static scheduling to ensure that collisions will nothappen at this OR gate or that any collision is pre-planned and OR-ingcolliding branch commands results in a valid, desired branch command.Alternatively, branch combiner 110 includes an intelligent element that,based on defined rules and/or priorities, selects a desired input. Ingeneral, branch combiner 110 selects the highest priority request forfurther propagation into processing cell 100. Prioritization may befixed, i.e., requests of some inputs have higher priority than those ofother inputs, or may be dynamic, such as in a round-robin scheme thatround-robins the highest priority between different inputs.Alternatively, each input command is tagged with a priority. Branchcombiner 110 then selects a command tagged with the highest priority.

When a valid branch command is received, operand un-format 120 decodesand parses the command to extract a branch target tag. In an embodiment,the branch target tag is “virtual” in that it indirectly referencesinstruction memory 140. The virtual branch target tag is used as aninput to branch lookup 130. If the lookup succeeds, branch lookup 130returns a local, physical branch target address that directly referencesinstruction memory 140. This physical branch target address is theninserted into program counter 150, causing execution of the processingcell to jump to that physical branch target address and continue fromthere. In addition to extracting the branch target tag, operandun-format 120 makes the full content of a branch command accessible todedicated functional units 180, so that information carried in anin-coming branch command can be incorporated into local computations.

Normally, an externally initiated branch command has higher prioritythan a local branch or an instruction incremented by the local programcounter 150. Thus, a branch command from outside a processing cell 100causes control flow to jump to the branch-tag-specified location. In onemode of usage, a processing cell 100 sits in an idle loop while waitingfor external branch commands, with loop-back implemented with a localbranch instruction. Thus, having the externally initiated branch commandtakes precedence over local branch and normal increment of local programcounter serves to initiate new computation at the processing cell. Inother uses, such as in some parallel searches, the ability of anexternally initiated branch command to interrupt local execution may beused to abort local computations. For example, multiple threads areemployed with each working on a different portion of the overall search.Once a solution is found, the problem is solved and threads that arestill searching should be aborted. An external branch can be used toimplement the abort.

Using a virtual branch target tag confers the flexibility of placinginstructions in each processing cell independently. In one usage of thisinvention, a branch command is multicast to multiple target processingcells that collaborate on a computation. By using a virtual branchtarget tag and performing a lookup to find the actual location of thebranch target instruction in local memory 140 of each target processingcell, the physical location of the target instruction can be differentin each processing cell 100, while accommodating a common branch targetname that is multicast to all the target processing cells.

In contrast, in systems that do not use virtual branch target tag andbranch target lookup, multicasting a branch instruction to multipletarget processing cells desires that the target instructions be locatedat the same memory address in every target processing cell. In general,each processing cell executes a different number of instructions betweenbranch target instructions. Aligning VLIW branch targets on differentprocessing cells thus pads instruction memory 140 with no-op (nooperation) instructions, resulting in inefficient use of instructionmemory 140. By using virtual branch target tag and translation throughtable lookup, various processing cells, e.g., in the same VLIW cluster,receive the same virtual branch target name, but do not necessarilybranch to the same local branch target address. The layout ofinstruction memory 140 can be different for each processing cell andeach processing cell can therefore use its instruction memory 140efficiently.

In some embodiments, branch lookup table 130 is an associative memorythat contains only desirable entries, e.g., entries in which a branchresults in useful work within the processing cell. Alternatively, branchlookup table 130 is a table indexed with the branch target name andcontains the branch target address. Lookup table 130 thus translates avirtual branch target name to a physical, local branch target address.

Instruction memory 140 holds instructions to be executed. Normally,instructions in instruction memory 140 are in the form of a byte or aword, and include a field from which the instructions are decoded.Instructions issued from instruction memory 140 generally controldedicated functional units 180 that are solely under the control of aprocessing cell or shared functional units 198 that are jointlycontrolled by multiple processing cells.

Program counter 150, like program counters in conventionalmicroprocessors, keeps track of the next instruction to be issued. Ingeneral, program counter 150 is incremented to execute instructions oneafter another in the order laid out in instruction memory 140. However,when a branch instruction is encountered, program counter 150 points toa location specified by that branch instruction. Typically, a branchinstruction has a field that provides program counter 150 with a valuedetermining the branch location.

Branch unit 160 decides a destination for a branch command, and supportsboth local and remote commands. In general, local commands are executedwithin processing cell 100 while remote commands are sent to outside ofprocessing cell 100, e.g., to another processing cell. Branch unit 160directs a local branch to program counter 150 for execution, andcomposes a remote branch command that is eventually sent to AND gates190 to be output to one or more other cells. A remote branch command isnormally passed to operand format 170. Generally, instruction memory140, program counter 150, and branch unit 160 control instructionsprocessed by processing cell 100 and may be referred to as instructioncontroller 125.

Operand format 170, branch unit 160, and dedicated functional units 180form remote branch commands. Consequently, branch unit 160, operandformat 170, and functional units 180, as a whole, may be referred to asa remote command or remote instruction generator 155. Operand format 170assembles bits supplied by the branch unit 160 and dedicated functionalunits 180 into a branch command. In an embodiment, branch unit 160supplies the bits representing the virtual branch target tag, whilededicated functional units 180 supply operands carried with the remotebranch command. In another embodiment, branch unit 160 makes thedecision to issue a remote branch, but dedicated functional units 180supply the branch target tag. Remote branch commands generated throughoperand format 170 propagate through AND gates 190 controlled by outputsteering bits on lines 188 supplied by instructions from instructionmemory 140, and arrive at the appropriate processing cell destinations.

Data received from another processing cell may be used in part or inwhole to form a new branch command. Operand un-format 120 makes thatdata accessible to dedicated functional units 180. Alternatively, a newbranch command may be assembled from scratch. Once formed, a new branchcommand may be sent to one or more output ports, e.g., AND gates 190,under instruction or data control. In an embodiment, the instructionbits, in the form of the output steering bits on lines 184, directlyspecify the selected destinations. Alternatively, dedicated functionalunits 180 provide the destinations, through control bits on lines 182.In both cases, an output steering control mux 165 selects betweeninstruction bits on line 182 and data bits on line 184 as the outputsteering bits 188 for controlling the output AND gates 190. In theabsence of hardware support for control of output ports by data valuefrom dedicated functional units 180, the same effect can be achievedindirectly by using the data as selector in a case statement. Programinstructions in instruction memory 140 is set up for the case statementso that each case invokes a remote branch instruction with theappropriate output ports enabled.

Dedicated functional units 180 process instructions such as performingloads, stores, arithmetic operations, etc., within a processing cell.

The outputs of AND gates 190 may be referred to as output ports, and thenumber of AND gates 190 varies depending the topology of processing cell100. Adding or subtracting AND gates 190 adds or subtracts output portsto processing cell 100. Each AND gate 190 determines whether a messageon line 186 propagates to a line 192 and thus output 196. If an AND gate190 is enabled, then the message can propagate through that AND gate 190and latch 194, to its corresponding output 196. Conversely, if an ANDgate 190 is disabled, then the message cannot propagate through that ANDgate. Each AND gate 190 is controlled, i.e., enabled, disabled,configured, etc., by a bit-vector or output steering bits on lines 188,supplied by instructions read out of instruction memory 140 or suppliedby dedicated functional units 180. Configuring, e.g., setting/resetting,the bit corresponding to an AND gate enables/disables that AND gate. Forexample, four bits B1, B2, B3, and B4 of a bit vector V1 correspondingto four AND gates 190(1), 190(2), 190(3), and 190(4), respectively, and,if bit BI is set while bits B2, B3, and B4 are reset, then only AND gate190(1) is enabled while AND gates 190(2), 190(3), and 190(4) aredisabled. As a result, because only AND gate 190(1) is enabled, data isonly sent to line 192(1) and to input of another processing cellconnected to this AND gate 190(1).

Latches 194 latch data on line 192 to line 196, and is useful inpipelining, a technique that allows for high-clock speed. Pipeliningdivides a long combinational logic path into several segments or stages,separated by latches. As a result, signals only have to pass throughshorter combinational paths between neighboring latches, resulting infaster system clocks and thus higher throughput because multipleinstances of the messages traversing a processing cell can be inprogress, each occupying a different stage of the pipeline. A latch 194thus allows more messages to be in flight at the same time, each in adifferent level of the pipeline stage. Latches 194 may be eliminated toreduce the number of clock cycles. Conversely, additional level oflatches may be added to a processing cell as appropriate, such as toallow for higher clock speed.

Outputs on lines 196 provide information or messages to anotherprocessing cell, e.g., via branch combiner 110 of that processing cell.

Shared functional units 198 are jointly controlled by more than oneprocessing cell such as when some resources used in a system utilizingthe processing cells are shared between the processing cells. Forexample, a FIFO (first-in-first-out) queue connects two neighboringprocessing cells in which one processing cell inserts data into the FIFOwhile another processing cell removes data from it. Both processingcells may also monitor the status of the FIFO queue, e.g., how full orempty it is, etc.

Processing Cell—State-Machine-Based Control

FIG. 1B shows a processing cell 200 with a state-machine based control,in accordance with an embodiment. Processing cell 200, like processingcell 100, may be referred to as a processing element, a processor, amicroprocessor with enhanced capabilities, or their equivalence.Processing cell 200 is comparable to processing cell 100, but includes astate machine 225 that replaces program counter 150 and instructionmemory 140 of processing cell 100. As a result, processing cell 200,besides state machine 225, comprises a branch combiner 210, an operandun-format 220, dedicated functional units 280, an operand format 270,AND gates 290 and latches 294. Branch combiners 110 and 210, operandun-formats 120 and 220, branch units 160 and 260, operand formats 170and 270, dedicated functional units 180 and 280, AND gates 190 and 290,and latches 194 and 294, are comparable. Like processing cell 100,processing cell 200 may also control shared functional units 298 thatare comparable to shared functional units 198.

State machine 225 includes next state logic 242, current state logic246, and control decode logic 244, that, together, provide the currentstate and future state of state machine 225. Current state logic 246encodes the present state, determines what should be done presently, andprovides the context from which next state logic 242 determines futurecontrol actions. Control decode logic 244 takes as input the currentstate's value, and from that generates control signals that are used tocontrol the operations of branch unit 260, dedicated functional units280, shared functional units 298, and output AND gates 290. Next statelogic 242, considering the current state and branch commands, decidesthe state that state machine 225 should be in next. Branch commands maybe local, e.g., conveyed by local branch unit 260, or remote, e.g.,externally generated by another processing cell and arrive via branchcombiner 210 and operand un-format 220.

In the following text, the term processing cell or processing cell 100is applicable to processing cell 200.

Functions of a Processing Cell

A processing cell may implement local control that performs computationin response to incoming requests. Alternatively, a processing cell canalso send branch commands to, and thus invoke execution on, otherprocessing cells. A collection of processing cells that collaborate in atightly coupled, statically scheduled manner may operate as one logicalthread. Execution on one processing cell with a known fixed timingrelationship with execution on another cell allows coordination betweenthe cells to be done statically and thus avoid run-time synchronizationcosts.

Sometimes, parallelism in an application comes in the form of threadparallelism. In these situations, static coordination between themultiple threads is not possible. Instead, multiple logical threads ofexecution are desired, with each thread of control making its ownbranching decisions such that it is impossible to predetermine a fixedtiming relationship between different threads' execution. Variousembodiments of the present invention can accommodate thread levelparallelism. Each processing cell can run a separate thread of control.A processing cell has sufficient local control resources to implementlocal control sequencing and local branching, and can therefore operateindependently.

By equipping each processing cell with flexible remote branch initiationand reception capabilities, allocating processing cell granularityresources to logical thread can change rapidly with little run-timeoverhead. Sending an appropriate remote branch command to a processingcell enables a logical thread to start using that processing cell toexecute part of its workload. The remote branch command specifies as itbranch target tag a value that refers to code that the thread wantsexecuted on that processing cell.

FIG. 2 shows a processing system or device 299 with 36 processing cellslaid out in a 2-dimension grid to illustrate how processing cells arere-allocated rapidly between logical threads. A coordinate (x, y) refersto a processing cell at a row x and a column y. At time t, five logicalthreads A, B, C, D, and E are running on device 299. Logical thread Auses twelve processing cells (0,0), (0,1), (0,2), (0,3), (1,0), (1,1),(1,2), (1,3), (2,0), (2,1), (2,2), and (2,3). Logical thread B uses oneprocessing cell (3, 1). Logical thread C uses 13 processing cells,(3,0), (3,2), (3,3), (4,0), (4,1), (4,2), (4,3), (4,4), (5,0), (5,1),(5,2), (5,3) and (5,4). Logical thread D uses six processing cells(0,4), (0,5), (1,4), (1,5), (2,4), (3,4), and logical thread E uses fourprocessing cells (2,5), (3,5), (4,5), and (5,5). At the end of time t,logical thread A finishes using processing cells (0,3), (1,3), and(2,3), while logical thread C finishes using processing cells (4,4) and(5,4). Depending on implementations, those processing cells that are nolonger used in a logical thread may execute a halt instruction aftertheir last instruction.

For further illustration purposes, logical thread D expands itsexecution to include execution on processing cells (0,3), (1,3) and(4,4), and does this by sending a remote branch command, at time t, fromprocessing cell (1,4) to processing cells (0,3) and (1,3), and fromprocessing cell (3,4) to processing cell (4,4). Assuming that a remotebranch command takes one cycle to reach a neighboring cell, and eachcell's output is directly connected to the nearest eight neighbors, theremote branch commands from cell (1,4) arrive at cells (0,3) and (1,3)at time (t+1) while the remote branch command from processing cell (3,4)arrives at processing cell (4,4) at time (t+1). These target processingcells then begin to execute code that is part of logical thread D.

A processing cell can receive information and pass it to the next cellwithout processing that information. For example, the receiving cell,upon receiving the information, issues a remote branch instruction torelay the information to a destination cell along a propagation chain.As an example, at least two cells are stacked together so that, asappropriate, one cell is responsible for passing the information, andthe other cell is responsible for processing, e.g., performingarithmetic operations on the information.

The Remote Branch Command

A remote branch command may be useful as means to implement remotefunction invocation. Sending a remote branch command to a targetprocessing cell invokes the function call. Input data in the form offunction call parameters may be supplied, e.g., bundled with the remotebranch command. Typically, a function invocation is marked by a callfollowed by a return. The return serves as a control-flow event thatmarks the end of the invocation so that appropriate subsequent executioncan begin. The return may also serve to convey return results.Embodiments of this invention allow invocation return to happen in oneof several ways.

When it is possible to predetermine an upper bound on the execution timeof the called function, the invocation return may be implicit andsilent, i.e., with no explicit actions triggering subsequent execution.Instead, the invoking thread can time the execution and initiatesubsequent execution upon reaching the predetermined execution timeupper bound. The processing cell executing the invoked function simplyfinishes what it is asked to do, and continues with other tasks orsimply halts. Return results may be left by the callee in memorylocations accessible by both the caller and the callee. Normally, thecaller accesses these locations after waiting for the invoked function'spredetermined maximum execution time. The caller, while waiting, canperform other tasks.

Alternatively, the target processing cell sends a remote branch commandto the caller when it is done executing the invoked function. Thearrival of this second remote branch command at a caller processing celltriggers execution of code that should begin after the functioninvocation. Returned results may be bundled with this remote branchcommand.

For a called function to be invoked from multiple callers, in anembodiment, each call supplies a return address with a remote branchcommand that makes the function call. The return address is used todetermine where to send the reply remote branch command. For example,the return address is used to select which output ports to send thereply command to. In the case of processing cells connected by switchednetwork connection, part of the return address may be used as a “returnroute” on the reply and used by the switches to dynamically route thereply to its destination. In an embodiment, operand un-format 120 makesthe return address available to dedicated functional units 180, whichuse this return address to form the appropriate reply by operand format170, and to select appropriate output ports.

Processing Cell Connections

Processing cells may be connected in various ways. However, theinvention is not limited to any one way of connection. Wires maydirectly connect processing cells to form hardwired interconnections.For example, a processing cell having four sets of outputs, each setbeing connected to a set of inputs of four north, west, south, and eastneighboring cells. Alternatively, field programmable wires that arestatically reconfigurable to form different interconnections may connectprocessing cells. This uses programmable wire technologies in whichprogrammable switches, usually in the form of a pass gate, separatesegments of wires. These switches are then controlled by RAM basedconfiguration bits that determine whether switches are closed or open.Using reconfigurable wires allow the neighbor of each processing cell tobe changed through reprograms, which is helpful in situations such aswhen a processing cell may need to interact with a different set ofother processing cells, e.g., in running different applications orduring different phases of an application.

Processing cells may be connected through interconnection switches. Theinterconnect switches route each branch command dynamically, usingeither additional destination or route information carried on eachbranch command, or internal state information kept at eachinterconnection switch.

A processing cell may be implemented to form an interconnection switch,in which this processing cell runs a program that forwards an incomingcommand to one or more destinations.

FIRST EXAMPLE USING PROCESSING CELLS

FIG. 3 shows a processing system 300 that includes three processingcells 310, 320, and 330 being embodiments of a processing cell 100and/or 200, in accordance with an embodiment. Processing cell 320 is tothe west of processing cell 330 while processing cell 310 is to the westof processing cell 320. For illustration purposes, system 300 implementsan exemplary piece of programming code 400 in FIG. 4.

FIG. 4 shows a piece of programming code 400 that include four basicblocks B410, B420, B430, and B440. Conceptually, a basic block includesinstruction code that is to be executed sequentially and without atransfer control. Once execution begins at a basic block, allinstructions within the basic block will be executed. Further, a basicblock starts after a branch command or at the target of a branchcommand, and ends before another branch command or another branchcommand target.

In FIG. 4's example, lines 414 to 419 constitute basic block B410 thatends with a branch, e.g., an “if” statement on line 419. Lines 420 to422 constitute basic block B420, which starts after the “if” statementon line 419 and ends before the “else” statement on line 423. Lines 424to 426 constitute basic block B430, which starts after the “else”statement on line 423 and ends before the end of the “else” block online 427, and line 428 constitutes basic block B440. For illustrativepurposes, lines 414, 420, and 426, are executed in processing cell 310.Lines 422 and 428 are executed in processing cell 320, and lines 416,418, and 424 are executed in processing cell 330.

FIG. 5 shows a system 500 illustrating how code 400 in FIG. 4 isimplemented in processing cells 310, 320, and 330 of system 300. Code400 in FIG. 4 constitutes a logical thread that is executed on thesethree processing cells 310, 320, and 330. Instruction blocks bb1 ofprocessing cell 330, bb1′ of processing cell 320 and bb1″ of processingcell 310 correspond to basic block B410. Similarly, blocks bb2, bb2′,and bb2″ correspond to basic block B420; blocks bb3, bb3′, and bb3′correspond to basic block B430; and blocks bb4, bb4′, and bb4″correspond to basic block B440.

FIG. 5 shows that, besides the original code in FIG. 4, cells 310, 320,and 330 have additional “coordination” code for coordinating executionof code 400. For example, the “halt” instruction on lines b104, b108,b204, etc., the “rbr” (remote branch) instruction on line b202, b206,b302, etc., and the “lbr” (local branch) instruction on lines b310,b314, b318, etc., are “control coordination” code. A “halt” instructionhalts execution of the program.

A “rbr” instruction sends a remote branch command to one or moreprocessing cells, with a virtual branch target tag that identifies thecode to be invoked at the destination(s). The “rbr” instruction may alsocarry additional data. In this example, because processing cell 330controls execution of processing cells 320 and 310 when execution isinitiated for basic blocks 410, 420 and 430, processing cell 330 is theorigin of several chains of remote branch commands. For example,processing cell 330 issues the command “rbr (out_(—)w, bb1′)” on lineb302, and sends this command to processing cell 320 on its west. Theparameter “bb1′” in the command specifies the address for the targetprocessing cell to start executing at virtual address or block “bb1′”.

The “lbr addr” instruction is a local branch instruction that transferslocal execution to an address “addr” in the same processing cell.

Coordination code is added during implementation of code 400 ontoprocessing cells 310, 320, and 330. Different entities such as a systemengineer, a compiler, a computer, etc., may implement code 400 ontoprocessing cells 310, 320, and 330. In coming up with the actual codethat runs on cells 310, 320 and 300, consideration is given to therelative timing to ensure that the resulting execution is in harmonywith code 400. For example, external remote branch commands arriving ata target are arranged so that they do not prematurely terminateexecution at the target processing cell. However, to simplify theexplanation, detailed timing consideration is not explicitly mentionedin most cases in the following texts.

For illustration purposes, execution of the example in FIG. 5 starts atprocessing cell 330. However, since the instruction “x=y/z” is to beexecuted on line b102 in processing cell 310, processing cell 330 issuesappropriate commands for that to happen. Processing cell 330, on lineb302, issues a command “rbr” to processing cell 320, which, in turn,sends another “rbr” command on line b202 to processing cell 310. Becauseprocessing cell 320 is on the west side of processing cell 330,processing cell 330 specifies the “out_(—)w” parameter for the commandto be sent to the west cell. Processing cell 330 also specifies thevirtual branch target address “bb1′”. Processing cell 330 then continuesto execute instructions “s=q*r”,“p=(s<3)”, and “lbr bb3 if not(p)”, onlines b304 to b310. These instructions correspond to lines 416, 418,419, and 425 in FIG. 4.

Within processing cell 330, the “rbr” instruction on line b302 residesin instruction memory 140 at a local physical address, e.g., addressla302. The instruction is issued when program counter 150 of processingcell 330 references address la302. Once issued, the instruction isforwarded to branch unit 160 for execution. Branch unit 160 identifiesinstruction “rbr” as a remote branch command, and thus forwardsparameter “bb1′”, a literal in this “rbr” instruction, to operand format170. At the same time, the “out_(—)w”parameter from the instruction isforwarded directly from instruction memory 140 to output AND gates 190as output steering bits to control output propagation. In this example,only the AND gate 190 that connects cell 330 to its western neighbor,i.e., cell 320, is enabled.

Instructions on lines b304 and b306 are issued in a similar fashion, butare forwarded to functional units within dedicated functional units 180.Finally, the instruction on line b310 is issued to branch unit 160,which conditionally executes it, i.e. if (p) is not true, then the localbranch succeeds and the branch target address is sent to program counter150. In some embodiments, branch unit 160 contains predicate registersto hold values such as p. The result of a compare instruction, such asthe one on line b306 is thus forwarded from dedicated functional units180 to branch unit 160 for storage. In other implementations, the valueof p is stored in a general-purpose register file within dedicatedfunctional units 180. In that case, the value p stored there is read outwhen a conditional branch instruction is issued, and forwarded to branchunit 160.

Regarding basic block 420, since the instruction “x=x*x” is to beexecuted by processing cell 310 at address bb2″ on line b106, processingcell 330 issues appropriate commands for that to happen. Execution onprocessing cell 330 enters block bb2 when the conditional local branchon line b310 results in an untaken branch, and local execution proceedssequentially into block bb2. At address bb2 on line b316, processingcell 330 issues a command “rbr (out_(—)w, bb2′)” to trigger execution ataddress bb2′ on processing cell 320. At address bb2′ on line b206,processing cell 320 in turns issues a command “rbr (out_(—)w, bb2”) toinitiate execution at address bb2″ on processing cell 310. Finally, theinstruction x=x*x on line b106 is executed at address bb2′ of processingcell 310.

Similar to the instructions “x=y/z” and “x=x*x,” since the instruction“x=2*x” on line b110 is to be executed by processing cell 310,processing cell 330 issues appropriate commands for that to happen.Execution on processing cell 330 is transferred to block bb3 when theconditional local branch command “lbr bb3 if not(p)” finds p to befalse, and thus results in a taken branch. At the branch target bb3,processing cell 330 sends a command “rbr” to initiate execution ataddress bb3′on processing cell 320. Processing cell 320 in turn issues acommand “rbr” on line b212 to trigger execution at address bb3″ onprocessing cell 310, which then executes the instruction “x=2*x.”

In the above example, processing cell 320 and 310 do not participate inthe local branching decision of processing cell 330, e.g., whenprocessing cell 330 issues the commands “lbr bb3 if not(p)” on lineb310. Processing cell 320, at address bb1′ on line b202, relaysinformation received from processing cell 330 to processing cell 310,without acting on the received information. In addition, in some cases,processing cell 320, as a receiving processing cell, also performs itsown tasks, e.g., executing the instruction “w=w*w” on line b208, afterinitiating the remote branch command “rbr” on line b206. Processing cell330 stops its execution by issuing the “halt” command on line b328.

Block bb4″ showing no instruction indicates that processing cell 310 hasno role in this block bb4″. As a result, performing a lookup table inprocessing cell 310 for block bb4″ will result in a failure to find amatch, in which case the remote branch command has no effect.

The above example also uses static timing analysis and schedule, asopposed to dynamic synchronization, to ensure that execution of a blockis completed before a new remote branch command arrives. For example,execution of block bb1″ of processing cell 310 is complete before thecommand to trigger execution of block bb2″ on the same processing cell310 arrives. In coming up with an appropriate schedule, the compiler ora human coder may have to delay initiating a remote branch command toensure that it does not arrive prematurely.

The above example also illustrates the concept of micro-threading.Whereas the original code 400 in FIG. 4 comprises of one logical thread,the actual implementation utilizes up to three processing cells, witheach processing cell executing one or more micro-thread. Generally, amicro-thread is triggered by an arriving remote branch command andterminates at a “halt” instruction. Thus FIG. 5 shows code representingone micro-thread on processing cell 330, 4 micro-threads on processingcell 320, and 3 micro-threads on processing cell 310.

SECOND EXAMPLE ILLUSTRATING DATA BEING TRANSFERRED FROM A PROCESSINGCELL TO ANOTHER PROCESSING CELL

FIG. 6 shows a piece of code 600 that works with a processing system 700to illustrate data being transferred from a processing cell to anotherprocessing cell. As compared to code 400, the instruction “x=x*x” online 420 in FIG. 4 has been changed to “x=x*s” on line 620.

FIG. 7 shows a system 700 illustrating how code 600 is implemented onprocessing cells 310, 320, and 330 of system 300. In this example,variable “s” is calculated on line b404 in block bb1 of processing cell330. However, variable “s” is also used on line b608 in block bb2″ ofprocessing cell 310. Processing cell 330 thus issues commands totransfer the value of variable “s” from processing cell 330 to blockbb2″ of processing cell 310. Consequently, processing cell 330 on lineb416 issues a command “rbr (out_(—)w, bb2′,[s])” that sends a remotebranch command to trigger execution at address bb2′ on processing cell320. In addition, the value of variable “s” is also bundled into thecommand “rbr.” Processing cell 320 on line b506 issues a command “b2_(—)local_(—)s=cmd_(—)data(0)” to initialize the local variable b2_(—)local_(—)s with the value “s”, which is the 0^(th) data parametercarried by the most recently arrived remote branch command to processingcell 320.

To further relay the value of variable “s” to processing cell 310,processing cell 320 on line b508 issues a command “rbr (out_(—)w, bb2,[b2 _(—)local_(—)s])” to send a remote command, with the data value “b2_(—)local_(—)s”, to processing cell 310 where it triggers execution ataddress bb2″. Processing cell 310, on line b606, extracts the value ofvariable “s” with a “cmd_(—)data” instruction, and stores that value inthe local variable b1 _(—)local_(—)s. Finally this value is used in theinstruction “x=x*b1 _(—)local_(—)s” on line b608.

The instruction “cmd_(—)data” is used to select data from among thosestored in operand un-format 120. Upon receiving a remote branchinstruction, the content of the command is parsed and stored in operandun-format 120. This includes any data bundled into the command. The“cmd_(—)data” instruction picks the appropriate piece of data amongstthose stored at operand un-format 120, and assigns it to a localgeneral-purpose register. The data bundled into a remote branch commandis viewed as forming an array, and the parameter of “cmd_(—)data”specifies the array index of the data to extract.

Modes of Operation

Computing systems built using processing cells can support both thesynchronous VLIW and the asynchronous MIMD modes of parallel execution.That is, embodiments of the invention allow control of branchingbehavior such that a single logical branch may affect a common set ofprocessing cells that work in close harmony in the manner of a VLIWarchitecture, or each processing cell autonomously executes branchesthat affect only itself in the manner of a MIMD architecture.Embodiments of the invention support spatial partitioning, i.e.,partitioning the processing cells in a system into subsets, each ofwhich operates in either the VLIW or MIMD mode. Furthermore, theassignment of processing cells to subsets can also be changed in as fewas one cycle, and enables seamless switching between VLIW and MIMD modesof operation. For illustration purposes, a cluster refers to a number ofprocessing cells.

VLIW Mode of Operation

When program tasks are known at compile time, and the target hardwareoperates with predictable execution time, the VLIW mode is commonlyused. Under this mode of operation, the tasks, including those that runconcurrently, are statically orchestrated in a synchronized manner.Because processing cells operate off clock signals that have known fixedtiming relationships between them, and remote branch commands propagateand execute with predictable time, relative execution time of tasksassigned to different processing cells is statically predictable as longas operations executed in the dedicated and shared functional units takepredictable time.

A cluster of processing cells operating in the VLIW mode can exchangedata and share resources without using run-time synchronization. Bytaking advantage of static predictability of execution time, staticorchestration can time the various read/write operations of a dataexchange to ensure that the reader does not read prematurely. Similarly,when multiple processing cells use a shared resource such as a sharedmemory port, static orchestration plans the multiple accesses fromdifferent processing cells so that they do not collide, but insteadoccur at different, non-overlapping times.

Generally, a computation is decomposed into operations to be executed inparallel. A compiler schedules operations on to all processing cells inthe VLIW cluster and their functional units, taking care that data iscomputed before it is used and that resources are not used for multiplepurposes at the same time. Instructions are presented in parallel and inlock-step sequences across all functional units. Synchronizationorchestrated at compile time is retained at run-time due to predictableexecution time of each instruction and the known fixed timingrelationship between the clocks of all processing cells in the VLIWcluster.

Normally, in the VLIW mode, a plurality of processing cells operates asa single logical processor and in a lock-step manner following a logicalthread of execution. Each processing cell, however, may have a differentprogram schedule. Thus, while each processing cell may have a differentrole to execute the program, they collaborate according to a commonclock. The above examples in FIG. 4 and FIG. 6 are examples of this modeof operation. In each example, a single logical thread is mapped on tothree processing cells, e.g., 310, 320, and 330. While the threeprocessing cells 310, 320, and 330 collaborate tightly, each runs itsown code.

Usually, program collaboration uses the ability to statically determine,e.g., at compile time, the relative rate of program execution ondifferent processing cells. With this knowledge, the compiler, forexample, plans execution on different processing cells so that valuesproduced on one processing cell are made available in time for use byanother processing cell. Systems with this kind of static predictabilityare commonly referred to as co-synchronous.

When a processing cell operates in the VLIW mode, it operates with otherprocessing cells of the system as a single cluster. In this mode, abranch instruction generated within an originating processing cell isused to cause other processing cells within the common cluster to branchto predictable program locations that can be statically scheduled by aVLIW compiler. Since processing cells are driven by a common clocksignal, the processing system can be engineered to move in lock-stepharmony.

In the full VLIW mode, the system comprises only one cluster with allprocessing cells constituting that cluster. For example, a system withten processing cells has one cluster, in which all ten processing cellsconstitute the cluster.

Systems using processing cells may also operate in multiple VLIW modes.For example, a portion of the system may be operating in a first VLIWmode and another portion may be operating in a second VLIW mode.Execution in the first portion or cluster is independent of execution inthe second cluster; however, execution in each cluster is lock-step.

MIMD Moded of Operation

Normally, the MIMD mode of parallel operation is used when program tasksare difficult to predict. In the MIMD mode, a computation is dividedinto multiple logical threads of execution that operate asynchronously.Run-time synchronization is desirable to coordinate the differentthreads, such as when data is exchanged between the logical threads, orthe multiple logical threads attempt to access shared resources, such asa shared memory port. Examples of traditional run-time synchronizationtechniques include semaphores, barriers, monitors, etc.

Generally, a MIMD computing system comprises multiple clusters ofprocessing cells. In the full MIMD mode, each cluster comprises a singleprocessing cell, and consequently, the number of clusters equals thenumber of processing cells. For example, in a system with 10 processingcells, there are 10 clusters, each with a processing cell. Eachprocessing cell thus operates as a separate processor, generatesseparate branch target addresses, and can independently branch atarbitrary moments in time. In general, each cluster of a MIMD modeexecution may itself comprise of multiple processing cells operating ina VLIW mode.

Processing systems using processing cells may support mixtures of bothVLIW and MIMD, i.e., some processing cells operate in the MIMD mode andsome other operate in the VLIW mode. The invention is not limited to thenumber of clusters or the number of processing cells in a cluster. Forexample, in a system with 10 processing cells, there are three clusterseach with 3, 3, and 4 processing cells or 2, 3, and 5 processing cells,etc. Alternatively, the system may include two clusters each with 4 and6 processing cells or 3 and 7 processing cells, etc. Within a cluster,processing cells operate in the VLIW mode with respect to each other,and, between clusters, processing cells operate in MIMD mode withrespect to each other.

THIRD EXAMPLE ILLUSTRATING MIMD AND VLIW MODES OF OPERATION ANDTRANSITIONING BETWEEN THE MODES

FIG. 8 shows a parallel program 800 that first sorts two arrays A and B,and then multiplies corresponding elements of the two arrays, leavingthe results in a third array C.

FIG. 9 shows a processing system 900 executing program 800, inaccordance with an embodiment. Processing system 900 includes twoprocessing cells 910 and 920, two RAM blocks 930(1) and 930(2), and twosynchronization registers 950(1) and 950(2). Processing cells 910 and920 share RAM blocks 930(1) and 930(2). For illustration purposes, eachRAM block 930 has one port capable of performing both read and writeoperations, and, in each clock cycle, each RAM block 930 performs onememory operation. Processing cells 910 and 920 sharing RAM blocks 930seek to resolve potential access conflicts, e.g., either through staticscheduling when they operate under the VLIW mode, or through dynamicsynchronization when they operate under the MIMD mode. Processing cells910 and 920 also share two synchronization registers 950(1) and 950(2).Each synchronization register 950 has a read port and a write port, eachof which can be accessed once in each cycle. A write in cycle t isvisible to a read performed the next cycle, e.g., t+1. A read thatoccurs in the same cycle t gets the value previously stored in theregister. Processing cell 910 can write register 950(1) and readregister 950(2), while processing cell 920 can write register 950(2) andread register 950(1).

When execution begins, arrays A and C are stored in RAM block 930(1) andarray B is stored in RAM block 930(2), and initial execution occurs onprocessing cell 910. Computations happen in two phases. Phase one istriggered when processing cell 910 sends a remote branch command toprocessing cell 920 to trigger execution at Y-phase1, while execution onprocessing cell 910 continues sequentially into X-phase1. During phaseone, the execution occurs in an MIMD mode, with each of processing cell910 and 920 performing a quick-sort on arrays A and B, respectively.Processing cell 910 accesses RAM block 930(1) while processing cell 920accesses RAM block 930(2) during this phase. Consequently, there is noconflict for accesses to the two RAM blocks 930. At the end of the firstphase, processing cells 910 and 920 perform dynamic synchronizationusing synchronization register 950(2), which is initialized to zerobefore the beginning of phase one execution. When processing cell 910finishes phase one execution, it repeatedly checks the value of register950(2) until it finds a one in registers 950(2). Conversely, whenprocessing cell 920 finishes phase one execution, it writes a one intoregister 950(2) and halts. When processing cell 910 finds a one inregister 950(2), processing cell 910 knows that processing cell 920 hascompleted phase one. Processing cell 910 then initiates phase twoexecution by sending a remote branch command to processing cell 920.

Phase two execution adds corresponding elements of arrays A and B toproduce elements of array C. Processing cell 910 performs themultiplication for elements in the first half of the arrays, whileprocessing cell 920 performs the multiplication for elements in thesecond half of the arrays. Both cells 910 and 920 write their resultsdirectly into array C. The second phase execution occurs under VLIW modeand takes advantage of static scheduling of VLIW mode to staticallyorchestrate the memory accesses.

FIG. 10 shows a table 1000 illustrating a schedule for accessing RAMblocks 930, in accordance with an embodiment. To avoid clutter, loopingdetails, such as loop index increment, and loop termination testing isleft out. In that schedule, there is at most one memory access to RAM930(1) in each clock cycle. Similarly, there is at most one memoryaccess to RAM 930(2) in each clock cycle.

To achieve the schedule in FIG. 10, the code generated for processingcells 910 and 920 counts the cycles of various operations, starting fromthe point where processing cell 910 initiates the command rbr(out_(—)e,Y-phase2). That is, a common reference time from which two sequences ofactions and their timings are followed. No-ops are inserted in the codeof processing cells 910 and 920 as appropriate so that the resultingcode exhibits the relative timing as indicated in the schedule of FIG.10 when they execute the loop to generate elements of array C bymultiplying elements of arrays A and B.

The example of FIG. 9 shows an example of MIMD mode execution and howthat is started from a degenerate VLIW mode of execution involving onlyone processing cell. The example then shows how the MIMD mode ofexecution ends through dynamic synchronization, and then transitionsinto the VLIW mode of execution involving multiple processing cells,e.g., two processing cells 910 and 920. The example also illustrates howVLIW mode of execution involves static scheduling that ensures thatconcurrent accesses to shared resources do not result in any conflicts.

Configuring Modes of Operation

Mode reconfiguration is generally done through program execution. Forexample, a single VLIW thread running on a multiple processing cellcluster might undergo a fission process. A remote branch multicasts toprocessing cells within the VLIW cluster may initiate execution thatends the close, synchronous collaboration between the processing cells.The example of FIG. 9 undergoes a process very similar to this as itenters phase one execution. However, in that example, the VLIW mode ofexecution prior to phase one was degenerated in that it utilizes onlyone processing cell. Those skilled in the art will recognize that otherexamples very similar to that in FIG. 9 can show that the initial VLIWmode of execution can utilize both processing cells 910 and 920. Afterdynamic reconfiguration is performed, the cluster has been divided intoa plurality of processing cell clusters.

As another example, multiple threads operating on a plurality ofclusters might undergo a fusion process. An example is the transitionfrom the MIMD mode of execution into the VLIW mode of executionillustrated by the example in FIG. 9 as it transitions from phase one tophase two execution. Thus after reconfiguration is performed, theplurality of clusters are merged into a single large cluster.

In the foregoing specification, the invention has been described withreference to specific embodiments thereof. However, it will be evidentthat various modifications and changes may be made thereto withoutdeparting from the broader spirit and scope of the invention.Accordingly, the specification and drawings are to be regarded asillustrative rather than as restrictive.

1. A computing system, comprising: a plurality of processing cellsincluding a first processing cell having a plurality of output ports,and a second processing cell having a plurality of input ports coupledto the plurality of output ports of the first processing cell; wherein aremote instruction that is transmitted by the first processing cell andtat is received by the second processing cell redirects programexecution of the second processing cell to an execution address sent bythe first processing cell via the remote instruction; the remoteinstruction propagates from within the first processing cell to at leastone selected output port of the first processing cell; and the selectedoutput port is configured for transporting the remote instruction to aprocessing cell of the plurality of processing cells based on whetherthat processing cell is intended to receive that remote instruction viathat selected output port.
 2. The computing system of claim 1 wherein anoutput port of the output ports of the first processing cell isassociated with a bit that controls configurations of that output port.3. The computing system of claim 1 wherein program execution on thefirst processing cell and the second processing cell run on clocks withknown timing relationships.
 4. The computing system of claim 3 whereinexecution of instructions of the first processing cell and the secondprocessing cell is scheduled based on known timing relationships betweenthose instructions.
 5. The computing system of claim 3 wherein the firstprocessing cell and the second processing cell operate in thevery-long-instruction-word mode based on known timing relationshipbetween the clocks run by the first processing cell and the secondprocessing cell, and predictability of execution of instructions of thefirst processing cell and the second processing cell.
 6. The computingsystem of claim 1 wherein at least one processing cell of the pluralityof processing cells operates autonomously as an independent processor.7. The computing system of claim 1 wherein a processing cell of theplurality of processing cells includes an input combiner for receivingremote instructions from at least one other processing cell and forselecting a remote instruction of the remote instructions to bepropagated into the processing cell.
 8. The computing system of claim 1wherein a processing cell of the plurality of processing cells includes:an instruction memory that store instructions to be executed, and afunctional unit that performs an operation in response to an instructiontaken from the instruction memory.
 9. The computing system of claim 8wherein the processing cell of the plurality of processing cells furtherincludes a lookup table that converts a target portion of a remoteinstruction received from at least one other processing cell to anaddress of the instruction memory.
 10. The computing system of claim 1wherein a processing cell of the plurality of processing cells includes:a state machine that manages executions of the processing cell, and afunctional unit that performs an operation in response to controlsignals from the state machine.
 11. The computing system of claim 1wherein the second processing cell forwards data associated with theremote instruction to a third processing cell.
 12. The computing systemof claim 1 wherein: a portion of the plurality of processing cells isarranged into at least one cluster that comprises at least oneprocessing cell that operates co-synchronously with other processingcells of the cluster such that execution timing relationships betweenprocessing cells of the cluster are statically predictable, andlatencies in distributing and processing remote instructions issued froma processing cell within the cluster to another cell within the clusteris predictable, and thereby permits static scheduling of processingcells within the cluster.
 13. The computing system of claim 1 whereinthe plurality of processing cells is arranged in a plurality of clusterseach of which operates in either a very-long-instruction-word mode or amultiple-instructions-multiple-data mode.
 14. The computing system ofclaim 1 wherein: the plurality of processing cells includes a firstcluster comprising a first plurality of processing cells and a secondcluster comprising a second plurality of processing cells, and duringexecution of the first cluster and the second cluster, a processing cellof the first cluster is configured to be used by the second cluster. 15.A computing system comprising: a plurality of processing cells that arearranged in a plurality of clusters wherein program executions aresynchronous within a cluster and are asynchronous between clusters;wherein a remote instruction that is issued by a first processing celland that is received by a second processing cell redirects execution ofthe second processing cell to an execution address sent by the firstprocessing cell; and the remote instruction propagates within the firstprocessing cell to a selected output port of the first processing cell;and the selected output port is configured for transporting the remoteinstruction to a processing cell of the plurality of processing cellsbased on whether that processing cell is intended to receive the remoteinstruction via the selected output port.
 16. A first processing cellfor use in a computing system having a plurality of processing cells,comprising: an input combiner for receiving remote instructions from aprocessing cell of the plurality of processing cells; a lookup table fortranslating a virtual target name embedded in a remote instruction to aphysical local target address to be used in instruction execution of thefirst processing cell; an instruction controller for controllingdestinations of instructions processed by the first processing cell; aremote instruction generator for generating instructions for use by asecond processing cell; at least one functional unit for processing theinstructions processed by the first processing cell; and output portsfor sending the instructions processed by the first processing cells foruse by the second processing cell; wherein an output port is configuredfor transporting an instruction for use by the second processing cellbased on whether the second processing cell is intended to receive thatinstruction via that output port.
 17. The first processing cell of claim16 wherein the input combiner selects a remote instruction of the remoteinstructions based on a priority associated with the remote instruction.18. The first processing cell of claim 16 wherein: a remote instructionsent from the first processing cell to the second processing cell and athird processing cell triggers execution at a memory address of thesecond processing cell and at a memory address of the third processingcell; and the memory address of the second processing cell differs fromthe memory address of the third processing cell.
 19. The firstprocessing cell of claim 16 wherein: the virtual target name is sentfrom a third processing cell to the first processing cell.
 20. The firstprocessing cell of claim 16 further comprising a latch for use inpipelining the instructions processed by the first processing cell. 21.The first processing cell of claim 16 wherein the instruction controllerincludes: a program counter for keeping track of instructions to beissued; an instruction memory for storing the instructions processed bythe first processing cell; and a branch unit for determining destinationand effect of an instruction for use by the second processing cell. 22.The first processing cell of claim 16 wherein the instruction controllerincludes a state machine for controlling operations in the firstprocessing cell.
 23. The first processing cell of claim 16 wherein: aninstruction of the instructions processed by the first processing cellis sent to the second processing cell via at least one output port ofthe first processing cell.
 24. The first processing cell of claim 16wherein: an instruction processed by the first processing cell istriggered by a remote instruction received from a third processing cellvia at least one output port of the third processing cell.